Empirical Methods for Exploiting Parallel Texts
نویسنده
چکیده
Parallel translations of written texts have long been useful tools for human students of language, and have begun to serve as an intriguing source of data for corpus-based approaches to natural language processing. A source text and its translation can be viewed as a coarse map between the two languages, and an industrious student or clever computer program may wish to refine that mapping so that it shows which sentences, phrases, and words are translations of one another. Humans are very adept at finding such relations in parallel text. This is true even when one or both of the languages is unfamiliar, as can be seen in a simple but convincing exercise in (Knight, 1997). While there was considerable early success in automatically identifying sentences in parallel text that are translations of each other (e.g., (Brown, Lai, and Mercer, 1991), (Gale and Church, 1993)), a variety of challenging problems has emerged since that time. Empirical Methods for Exploiting Parallel Texts is a revision of the author’s 1998 Ph.D. dissertation (University of Pennsylvania), and succeeds in capturing the range of problems inherent in parallel text. It presents a variety of techniques for finding translation equivalents and demonstrates that once these are available they can be used to align text segments, detect omissions in translations, identify non-compositional compounds, and discriminate among word senses.
منابع مشابه
Empirical Methods for Exploiting Parallel Texts
Parallel translations of written texts have long been useful tools for human students of language, and have begun to serve as an intriguing source of data for corpus-based approaches to natural language processing. A source text and its translation can be viewed as a coarse map between the two languages, and an industrious student or clever computer program may wish to refine that mapping so th...
متن کاملاستخراج پیکره موازی از اسناد قابلمقایسه برای بهبود کیفیت ترجمه در سیستمهای ترجمه ماشینی
Data used for training statistical machine translation method are usually prepared from three resources: parallel, non-parallel and comparable text corpora. Parallel corpora are an ideal resource for translation but due to lack of these kinds of texts, non-parallel and comparable corpora are used either for parallel text extraction. Most of existing methods for exploiting comparable corpora loo...
متن کاملEmpirical Methods for MT Lexicon Development
This article reviews some recently invented methods for au tomatically extracting translation lexicons from parallel texts The ac curacy of these methods has been signi cantly improved by exploiting known properties of parallel texts and of particular language pairs The state of the art has advanced to the point where translations can be found automatically and with high reliability even for no...
متن کاملExploiting Parallel Texts for Word Sense Disambiguation: An Empirical Study
A central problem of word sense disambiguation (WSD) is the lack of manually sense-tagged data required for supervised learning. In this paper, we evaluate an approach to automatically acquire sensetagged training data from English-Chinese parallel corpora, which are then used for disambiguating the nouns in the SENSEVAL-2 English lexical sample task. Our investigation reveals that this method ...
متن کاملMining parallel fragments from comparable texts
This paper proposes a novel method for exploiting comparable documents to generate parallel data for machine translation. First, each source document is paired to each sentence of the corresponding target document; second, partial phrase alignments are computed within the paired texts; finally, fragment pairs across linked phrase-pairs are extracted. The algorithm has been tested on two recent ...
متن کامل